```{r setup, include=FALSE}
# Global knitr options. These propagate to every chunk unless overridden locally.
# - echo=TRUE      : show the code that produced each output
# - message/warning=FALSE : keep package startup chatter out of the report
# - fig.align='center', fig.width/height : sensible defaults for plots
knitr::opts_chunk$set(
  echo       = TRUE,
  message    = FALSE,
  warning    = FALSE,
  fig.align  = "center",
  fig.width  = 7,
  fig.height = 4.5
)
```

# Objectives

This practical is mostly reading and configuration but you should try and complete the 2 parts labelled *Exercise:* and submit the knitted html and a link to your github/gitlab repo via BrightSpace.  This practical will not be graded but is your chance to make sure everything works properly and is submitted in the correct format for later practicals.

In terms of specific learning objectives:

- Set up a reproducible R project with RStudio, an `.Rproj` file, and a Git repository.
- Read, manipulate, and reshape tabular data using core **tidyverse** verbs (`dplyr` + `tidyr`).
- Build layered visualisations with **ggplot2**.
- Author a literate analysis with **R Markdown** that knits to HTML from a clean session.
- Use Git to commit and push your work to GitHub/GitLab, with an appropriate `.gitignore`.
- Produce a `sessionInfo()` footer so a reader knows exactly which package versions produced your results.

These skills underpin every subsequent practical in this course!


# Why reproducibility matters in health data science

A "reproducible" analysis is one where another researcher (or future-you) can take your code and data and obtain the *same* results. In health research the stakes are higher than usual:

- Findings inform clinical decisions and policy. Errors propagate.
- Datasets are often access-controlled, so the *code* is the artefact others scrutinise most.
- Regulators and journals increasingly require analysis code as a deliverable.

A practical reproducibility checklist for lab practicals:

1. **One project = one folder = one Git repository = one `.Rproj` file.** No absolute paths like `C:/Users/me/Desktop/...`.
2. **Code, data, and outputs are separated** (e.g. `R/`, `data/`, `figures/`, `output/`).
3. **All package loads are explicit and at the top of the document.**
4. **Random operations use `set.seed()`.**
5. **Knit from a clean R session** (Session → Restart R) before submitting - this catches hidden state bugs.
6. **`sessionInfo()` is recorded at the bottom of the report.**
7. **Raw data is read-only.** Cleaning produces *new* files.

# Terminology

- **R** - the open-source statistical programming language used throughout this course.
- **RStudio / Posit** - R's most popular IDE. It bundles editor, console, plots, environment, and Git into one window.
- **Package (library)** - R's equivalent of a Python module. Installed once with `install.packages()` and loaded per session with `library()`.
- **Tidyverse** - a coherent collection of packages (`dplyr`, `tidyr`, `ggplot2`, `readr`, `tibble`, `purrr`, `stringr`, `forcats`, `lubridate`) that share design principles and the pipe-friendly grammar.
- **Tibble** - the tidyverse's modernised data frame. Behaves like `data.frame` but prints more cleanly and never silently changes types.
- **R Markdown / Quarto** - literate programming formats: prose + executable code chunks, knitted to HTML/PDF/Word.
- **Git** - the dominant decentralised version-control system.
- **GitHub / GitLab** - hosted Git platforms. Dalhousie has its own GitLab at `gitlab.cs.dal.ca`.

# Setting up your system

## Install RStudio and Git

1. Install **R** (≥ 4.1, so the native pipe `|>` is available): <https://cran.r-project.org/>
2. Install **RStudio Desktop**: <https://posit.co/download/rstudio-desktop/> 
3. Install **Git**: <https://git-scm.com/downloads>

Verify Git from a terminal and add your details to the configuration:

```bash
git --version
git config --global user.name  "YOURNAME"
git config --global user.email "YOUREMAIL"
```

## The RStudio panes

By default RStudio has four panes (configurable in *Tools → Global Options → Pane Layout*):

- **Editor** (top-left) - where you write `.R`, `.Rmd`, and other source files.
- **Console / Terminal** (bottom-left) - interactive R, plus a system shell tab.
- **Environment / History / Git** (top-right) - current variables, command history, and Git status.
- **Files / Plots / Packages / Help / Viewer** (bottom-right) - file browser, plot output, help pages.

Quick sanity check - type the following in the Console and press Enter:

```{r}
x <- 2
x
```

You should see `x` appear in the Environment pane.

## Create an RStudio project linked to Git

The recommended workflow is **GitHub/Gitlab-first**:

1. Create an empty repository on GitHub/GitLab with a sensible name e.g. `arhds-labs` (you can add a README and an R `.gitignore` template if prompted). 
3. In RStudio: *File → New Project → Version Control → Git*, paste the repo URL, choose a parent directory.
4. RStudio creates a folder containing an `.Rproj` file. **Always open that `.Rproj`** to start work - it sets the working directory automatically, which is the foundation of path reproducibility.


A sensible starter `.gitignore` for R projects:

```
.Rhistory
.RData
.Ruserdata
.Rproj.user/
*.html
*.pdf
/data/raw/         # if data is large or non-redistributable
/renv/library/     # if using renv
```

5. When submitting labs you must include a link to your github/gitlab.  

  - Dalhousie Gitlab: when you create your repository if set the `visibility level` to `internal` it will be public for anyone logged into `git.cs.dal.ca` and you don't need to do any other configuration.  If you want to limit access, set it as private, create it, and then using the left-side menu `Manage -> Members -> Invite Members` and invite my csid `finlaym` to the repository.
  
  - Github: when you create your repository if you `Choose visibility` as `public` then anyone online can see it and you don't need to do any other configuration.  If you want to limit access: set it to private, create, and click on invite collaborators and add `fmaguire` to your repository.

## Aside: Reproducible package-management: `renv`

For coursework, plain `library()` calls are fine. For your bigger research projects, you should consider using a utility like **`renv`** (`renv::init()`) to pin exact package versions in `renv.lock` so collaborators get the same environment. 

# R fundamentals (a quick refresher)

If you have never used R, have a look at the [Harvard Chan Intro-R module](https://hbctraining.github.io/Training-modules/IntroR/) material. We will go over the compressed key details.

## Vectors

A vector is an ordered collection of values of the *same type*. R indexes from **1** (not 0) and supports negative and logical indexing.  Note: the negative indexing works differently than other languages!

```{r}
my_vector <- c(1, 4, 3, 2)
my_vector[2]              # second element
my_vector[-2]             # all elements EXCEPT the second
my_vector[2:3]            # second through third
my_vector[my_vector > 2]  # logical indexing

length(my_vector)
mean(my_vector)
sd(my_vector)

# Append
my_vector <- c(my_vector, 90)
my_vector <- c(30, my_vector)
my_vector
```

## Factors

A factor encodes a categorical variable. Levels are the allowed values; by default they are alphabetically ordered, which is rarely what you want.

```{r}
expression <- c("low", "high", "medium", "high", "low", "medium", "high")

# Default: alphabetical ordering - usually wrong for ordinal data
factor(expression)

# Specify a meaningful order
factor(expression, levels = c("low", "medium", "high"))
```

In modern tidyverse code, prefer **`forcats`** (`fct_relevel`, `fct_infreq`, `fct_lump`) over base R for factor manipulation.

## Data frames and tibbles

A data frame is a rectangular table whose columns may have different types. A **tibble** is the tidyverse drop-in replacement: same idea, better defaults.

```{r}
# Base R
df_base <- data.frame(
  patient_id = c("P01", "P02", "P03"),
  age        = c(58, 64, 71),
  sex        = c("F", "M", "F"),
  sbp_mmhg   = c(132, 145, 128)   # systolic blood pressure
)
df_base

# Tibble equivalent
library(tibble)
tb <- tibble(
  patient_id = c("P01", "P02", "P03"),
  age        = c(58, 64, 71),
  sex        = c("F", "M", "F"),
  sbp_mmhg   = c(132, 145, 128)
)
tb
```

Notice the tibble prints column types (`<chr>`, `<dbl>`) - useful when debugging type coercion bugs.

## The pipe

R has two pipes:

- `|>` - the **native pipe**, built into R ≥ 4.1. No package required.
- `%>%` - the **magrittr pipe**, loaded with `dplyr`/`tidyverse`. Older code uses this almost exclusively.

Both pass the left-hand side as the first argument of the right-hand side. For new code, **prefer `|>`**.

```{r}
# These three are equivalent
exp(sqrt(16))                 # nested calls - read inside-out
sqrt(16) |> exp()             # native pipe (R 4.1+)

library(magrittr, quietly = TRUE)
sqrt(16) %>% exp()            # magrittr pipe
```

The pipe lets you read data transformations left-to-right, top-to-bottom, like a recipe.

# The tidyverse, in one chunk

```{r}
library(tidyverse)   # loads dplyr, tidyr, ggplot2, readr, tibble, purrr, stringr, forcats, lubridate
```

This lab will need the following 3 packages so you can install new packages like this:

```{r eval=FALSE}
install.packages(c("tidyverse", "datasauRus", "here"))
```

There is no need to run `library(dplyr)` *and* `library(tidyverse)` - the latter loads the former. Stick with `library(tidyverse)` for analysis scripts; load individual packages only when writing a package or a constrained Shiny app.

# Data manipulation with `dplyr`

`dplyr` provides a small set of verbs that compose into rich pipelines. We will use a tiny synthetic clinical dataset throughout.

```{r}
set.seed(2026)   # reproducibility: any random draws below give identical results every run

clinic <- tibble(
  patient_id = sprintf("P%03d", 1:8),
  age        = c(58, 64, 71, 49, 82, 33, 67, 55),
  sex        = c("F", "M", "F", "M", "F", "M", "F", "M"),
  smoker     = c(FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE),
  sbp_mmhg   = c(132, 145, 128, 118, 162, 121, 150, 135),   # systolic BP
  bmi        = c(27.4, 31.2, 24.8, 22.0, 29.5, 21.3, 33.1, 26.6)
)
clinic
```

## `select()` - pick columns

```{r}
clinic |> select(patient_id, age, sbp_mmhg)

# Helpers
clinic |> select(starts_with("s"))
clinic |> select(where(is.numeric))
```

Python equivalent: `df[["patient_id", "age", "sbp_mmhg"]]` or `df.select_dtypes(include="number")`.

## `filter()` - pick rows

```{r}
# Hypertensive smokers
clinic |> filter(sbp_mmhg >= 140, smoker)

# Logical OR
clinic |> filter(age >= 65 | bmi >= 30)
```

Python equivalent: `df.query("sbp_mmhg >= 140 and smoker")`.

## `mutate()` - create or modify columns

```{r}
clinic |>
  mutate(
    bp_category = case_when(
      sbp_mmhg <  120 ~ "normal",
      sbp_mmhg <  130 ~ "elevated",
      sbp_mmhg <  140 ~ "stage 1",
      TRUE            ~ "stage 2"
    ),
    obese = bmi >= 30
  )
```

`case_when()` is the multi-branch `if/else` of the tidyverse - much cleaner than nesting `ifelse()` calls.

## `arrange()` - sort

```{r}
clinic |> arrange(desc(sbp_mmhg))
```

## `summarise()` and `group_by()` - collapse rows

```{r}
clinic |>
  group_by(sex) |>
  summarise(
    n            = n(),
    mean_age     = mean(age),
    mean_sbp     = mean(sbp_mmhg),
    pct_smokers  = mean(smoker) * 100,
    .groups      = "drop"
  )
```

Python equivalent: `df.groupby("sex").agg(...)`.

## `across()` - apply a function to multiple columns

`across()` (introduced in dplyr 1.0) is the modern way to compute the same summary for many columns:

```{r}
clinic |>
  group_by(sex) |>
  summarise(across(c(age, sbp_mmhg, bmi), \(x) mean(x, na.rm = TRUE)),
            .groups = "drop")
```

The `\(x) ...` syntax is R 4.1's anonymous-function shorthand (equivalent to `function(x) ...` or `lambda x: ` in python).

## Other verbs worth knowing

- `rename(new = old)` - rename columns.
- `relocate(col, .before = other)` - reorder columns.
- `distinct()` - drop duplicate rows.
- `slice_max(col, n = 5)` / `slice_min()` / `slice_sample(n = 100)` - pick rows by rank or randomly.

# Reshaping data with `tidyr`

**Tidy data** has three properties:

1. Each variable is a column.
2. Each observation is a row.
3. Each cell is a single value.

Most messy datasets violate one of these. Two verbs do most of the work:

## `pivot_longer()` - wide → long

(`pivot_longer()` replaces the older `gather()`. You may still see `gather()` in older code; it works but is no longer recommended.)

```{r}
life_expectancy <- tribble(
  ~country,    ~`2010`, ~`2015`, ~`2020`,
  "Australia",  82.0,    82.4,    83.0,
  "Canada",     80.7,    81.5,    81.9,
  "France",     81.8,    82.3,    83.0
)
life_expectancy

le_long <- life_expectancy |>
  pivot_longer(
    cols      = -country,        # everything except country
    names_to  = "year",
    values_to = "expectancy"
  ) |>
  mutate(year = as.integer(year))

le_long
```

## `pivot_wider()` - long → wide

(`pivot_wider()` replaces `spread()`.)

```{r}
le_long |>
  pivot_wider(names_from = year, values_from = expectancy)
```

## Other useful `tidyr` verbs

```{r}
# Split one column into many
tibble(x = c("a_1", "b_2")) |>
  separate(x, into = c("letter", "number"), sep = "_")

# Carry the last observation forward (e.g. visit dates)
tibble(visit = c(1, NA, NA, 4)) |>
  fill(visit)

# Drop rows with any missing
tibble(x = c(1, 2, NA), y = c(3, NA, 5)) |>
  drop_na()

# Replace specific NAs
tibble(x = c(1, 2, NA), y = c(3, NA, 5)) |>
  replace_na(list(x = 0, y = 99))
```

# Reading and writing data with `readr` and `here`

Hard-coded paths break reproducibility. The `here` package resolves paths relative to the project root (the folder containing your `.Rproj`):

```{r eval=FALSE}
library(here)

# Write the clinic tibble to data/processed/
dir.create(here("data", "processed"), recursive = TRUE, showWarnings = FALSE)
write_csv(clinic, here("data", "processed", "clinic.csv"))

# Read it back - works on any machine, regardless of where the project lives
clinic2 <- read_csv(here("data", "processed", "clinic.csv"))
```

`readr` functions (`read_csv`, `read_tsv`, `read_delim`) are faster than base R's `read.csv` and never silently coerce strings to factors.

# Visualisation with `ggplot2`

`ggplot2` implements the *Grammar of Graphics*: every plot is a stack of layers built from **data + aesthetic mappings + geometric objects + scales + coordinate system + theme**.

```{r}
data(mpg)   # built-in fuel-economy dataset

# A plot is built up with `+`, NOT the pipe.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
```

Add aesthetic mappings - colour by class:

```{r}
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) +
  geom_point(alpha = 0.7) +
  labs(
    x      = "Engine displacement (L)",
    y      = "Highway MPG",
    colour = "Vehicle class",
    title  = "Larger engines deliver lower fuel economy"
  ) +
  theme_minimal()
```

Common geoms:

```{r}
# Bar chart of counts
ggplot(mpg, aes(x = class)) + geom_bar() + theme_minimal()

# Histogram
ggplot(mpg, aes(x = hwy)) + geom_histogram(bins = 20) + theme_minimal()

# Density
ggplot(mpg, aes(x = hwy, fill = drv)) +
  geom_density(alpha = 0.4) +
  theme_minimal()

# Faceting - small multiples
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class) +
  theme_minimal()
```

# Exercise: Datasaurus

This exercise demonstrates *why we visualise data*: thirteen datasets with nearly identical summary statistics but wildly different shapes.

```{r}
library(datasauRus)

datasaurus_dozen |>
  count(dataset)
```

The original Datasaurus is from Alberto Cairo's [blog post](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html); the rest are from Matejka & Fitzmaurice's *Same Stats, Different Graphs* (CHI 2017).

**Q1.** How many rows and columns does `datasaurus_dozen` contain, and what are the variables?

**Q2.** Uncomment and complete the ggplot code to plot `y` vs `x` for the `dino` subset, and compute the Pearson correlation.

```{r}
dino_data <- datasaurus_dozen |>
  filter(dataset == "dino")

#ggplot(dino_data, aes( # complete...

dino_data |> summarise(r = cor(x, y))
```

**Q3.** Repeat for the `star` dataset. Compare its `r` to that of `dino`.

**Q4.** Repeat for the `circle` dataset. Compare its `r` to that of `dino`.

**Q5.** Complete the following code to visualise all the datasets at once with faceting and calculate the statistics as a single grouped `summarise` command.

```{r fig.width=8, fig.height=8}
ggplot(datasaurus_dozen, aes(x = x, y = y, colour = dataset)) +
  geom_point(size = 0.7) +
  facet_wrap(~ dataset, ncol = 3) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text       = element_blank(),
        axis.ticks      = element_blank())

datasaurus_dozen |>
  group_by(dataset) |>
  summarise(
    mean_x = mean(x),
    #... complete this to calculate the mean for y, standard deviation for x and y, and pearson correlation
    .groups = "drop"
  )
```

**Q6.** Write 2–3 sentences in your knitted document on why these summary statistics are nearly identical despite the obvious visual differences, and what this implies for exploratory data analysis on real clinical datasets.

# Exercise: Air Quality Data

R ships with `airquality` (daily air-quality measurements, NY 1973). Despite its age it is a useful, mildly messy dataset for practising the verbs above.

**Q7. Using `airquality`:**

(a) Drop rows where `Ozone` is missing.
(b) Add a column `month_name` with the month spelled out (`"May"`, `"Jun"`, …). Hint: `month.abb` is a built-in vector.
(c) Compute mean `Ozone`, mean `Temp`, and the count of complete days per month.
(d) Plot `Ozone` against `Temp`, coloured by month, with a smoothed trend line (`geom_smooth(method = "lm")`).

```{r}
aq <- as_tibble(airquality)

#...
```

# Git workflow for this practical

A minimal commit cycle from the RStudio Terminal pane (or a system shell):

```bash
git status                               # what changed?
git add lab0_reproducible_research_tidyverse.Rmd
git commit -m "Lab 0: complete tidyverse + ggplot exercises"
git push
```

You can also use the **Git tab** in RStudio's top-right pane: tick the files to stage, click *Commit*, write a message, click *Push*.

Best practice:

- Commit early and often. Small commits are easier to review and revert.
- Write present-tense, imperative messages ("Add Q5 facet plot", not "Added the plot").
- **Don't commit your knitted HTML or PDFs** - they are derived artefacts. The `.gitignore` above already excludes them.

# Reproducibility footer

Always end an analysis report with a record of the environment that produced it. If your collaborator reports different results, this is the first thing to compare.

```{r}
sessionInfo()
```

# Submission

For each practical you will submit:

1. The **knitted HTML** of your R Markdown notebook to Brightspace. Knit from a *clean session* (Session → Restart R → Knit).
2. A **link to the source `.Rmd` in your Git repository** (public, or shared with `github:fmaguire` / `gitlab.cs.dal.ca:finlaym` if private - see explanation above).

Due **midnight before the next week's practical**.

# Optional further reading

- Wickham, Çetinkaya-Rundel & Grolemund - [*R for Data Science (2e)*](https://r4ds.hadley.nz/) - the canonical tidyverse reference, free online.
- [R Markdown cheatsheet](https://rstudio.github.io/cheatsheets/rmarkdown.pdf) and [Markdown cheatsheet](https://www.markdownguide.org/cheat-sheet/).
- Bryan, J. - [*Happy Git and GitHub for the useR*](https://happygitwithr.com/) - the friendliest Git-for-R guide that exists.
- The Turing Way - [*Guide for Reproducible Research*](https://book.the-turing-way.org/reproducible-research/reproducible-research).